Python for DS - Lab 3¶
Solal ZANA and Adam BERDAH¶
Exercise 1: Training a model and tuning its hyperparameter¶
Q1¶
Predicting the number of passengers on a given flight is a regression problem because the target variable is continuous. This is also a supervised learning problem, as the model would be trained on historical data with known numbers of passengers to be able to learn to predict that value based on new inputs.
Q2¶
import pandas as pd
import numpy as np
df = pd.read_csv("/Users/solalzana/Desktop/DSB PDS/Lab3 - sklearn & hyperparams tuning/data/train.csv.bz2")
df.dropna(inplace=True) # in case they were any missing value (there are none)
df
| DateOfDeparture | Departure | Arrival | WeeksToDeparture | log_PAX | std_wtd | |
|---|---|---|---|---|---|---|
| 0 | 2012-06-19 | ORD | DFW | 12.875000 | 12.331296 | 9.812647 |
| 1 | 2012-09-10 | LAS | DEN | 14.285714 | 10.775182 | 9.466734 |
| 2 | 2012-10-05 | DEN | LAX | 10.863636 | 11.083177 | 9.035883 |
| 3 | 2011-10-09 | ATL | ORD | 11.480000 | 11.169268 | 7.990202 |
| 4 | 2012-02-21 | DEN | SFO | 11.450000 | 11.269364 | 9.517159 |
| ... | ... | ... | ... | ... | ... | ... |
| 8897 | 2011-10-02 | DTW | ATL | 9.263158 | 10.427055 | 7.316967 |
| 8898 | 2012-09-25 | DFW | ORD | 12.772727 | 12.201552 | 10.641034 |
| 8899 | 2012-01-19 | SFO | LAS | 11.047619 | 10.508746 | 7.908705 |
| 8900 | 2013-02-03 | ORD | PHL | 6.076923 | 10.174042 | 4.030334 |
| 8901 | 2011-11-26 | DTW | ATL | 9.526316 | 9.202674 | 6.167733 |
8902 rows × 6 columns
The columns have different types: datetime, then categorical variables for Departure and Arrival ( as they are string values), and numerical variables for log_PAX and std_wtd.
Q3¶
from pandas.tseries.holiday import USFederalHolidayCalendar
df['DateOfDeparture'] = pd.to_datetime(df['DateOfDeparture']) # Convert to proper dates
# Create new columns
df['day'] = df['DateOfDeparture'].dt.day
df['weekday'] = df['DateOfDeparture'].dt.weekday + 1
df['week'] = df['DateOfDeparture'].dt.isocalendar().week
df['month'] = df['DateOfDeparture'].dt.month
df['year'] = df['DateOfDeparture'].dt.year
holidays = USFederalHolidayCalendar().holidays(start=min(df['DateOfDeparture']), end=max(df['DateOfDeparture']))
df['holiday'] = (df['DateOfDeparture'].isin(holidays)).astype(int)
df['weekend'] = (df['DateOfDeparture'].dt.weekday >= 5).astype(int) # 1 if it's saturday or sunday
print((df['holiday'] == 1).sum()) # 309 holiday days
print((df['weekend'] == 1).sum()) # 2497 weekend days
df
309 2497
| DateOfDeparture | Departure | Arrival | WeeksToDeparture | log_PAX | std_wtd | day | weekday | week | month | year | holiday | weekend | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2012-06-19 | ORD | DFW | 12.875000 | 12.331296 | 9.812647 | 19 | 2 | 25 | 6 | 2012 | 0 | 0 |
| 1 | 2012-09-10 | LAS | DEN | 14.285714 | 10.775182 | 9.466734 | 10 | 1 | 37 | 9 | 2012 | 0 | 0 |
| 2 | 2012-10-05 | DEN | LAX | 10.863636 | 11.083177 | 9.035883 | 5 | 5 | 40 | 10 | 2012 | 0 | 0 |
| 3 | 2011-10-09 | ATL | ORD | 11.480000 | 11.169268 | 7.990202 | 9 | 7 | 40 | 10 | 2011 | 0 | 1 |
| 4 | 2012-02-21 | DEN | SFO | 11.450000 | 11.269364 | 9.517159 | 21 | 2 | 8 | 2 | 2012 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 8897 | 2011-10-02 | DTW | ATL | 9.263158 | 10.427055 | 7.316967 | 2 | 7 | 39 | 10 | 2011 | 0 | 1 |
| 8898 | 2012-09-25 | DFW | ORD | 12.772727 | 12.201552 | 10.641034 | 25 | 2 | 39 | 9 | 2012 | 0 | 0 |
| 8899 | 2012-01-19 | SFO | LAS | 11.047619 | 10.508746 | 7.908705 | 19 | 4 | 3 | 1 | 2012 | 0 | 0 |
| 8900 | 2013-02-03 | ORD | PHL | 6.076923 | 10.174042 | 4.030334 | 3 | 7 | 5 | 2 | 2013 | 0 | 1 |
| 8901 | 2011-11-26 | DTW | ATL | 9.526316 | 9.202674 | 6.167733 | 26 | 6 | 47 | 11 | 2011 | 0 | 1 |
8902 rows × 13 columns
# !pip install ydata_profiling
from ydata_profiling import ProfileReport
ProfileReport(df, title="Sample Data Report")
Improve your data and profiling with ydata-sdk, featuring data quality scoring, redundancy detection, outlier identification, text validation, and synthetic data generation.
Summarize dataset: 0%| | 0/5 [00:00<?, ?it/s]
Generate report structure: 0%| | 0/1 [00:00<?, ?it/s]
Render HTML: 0%| | 0/1 [00:00<?, ?it/s]
Q4¶
# Selecting numerical features
df.dtypes
numerical_features = df.select_dtypes(include=['int', 'float']).columns.tolist() # tolist to remove the Index() when printing
print(f"Numerical features : {numerical_features}", )
df2 = df.select_dtypes(include=['int', 'float'])
df2
Numerical features : ['WeeksToDeparture', 'log_PAX', 'std_wtd', 'day', 'weekday', 'month', 'year', 'holiday', 'weekend']
| WeeksToDeparture | log_PAX | std_wtd | day | weekday | month | year | holiday | weekend | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 12.875000 | 12.331296 | 9.812647 | 19 | 2 | 6 | 2012 | 0 | 0 |
| 1 | 14.285714 | 10.775182 | 9.466734 | 10 | 1 | 9 | 2012 | 0 | 0 |
| 2 | 10.863636 | 11.083177 | 9.035883 | 5 | 5 | 10 | 2012 | 0 | 0 |
| 3 | 11.480000 | 11.169268 | 7.990202 | 9 | 7 | 10 | 2011 | 0 | 1 |
| 4 | 11.450000 | 11.269364 | 9.517159 | 21 | 2 | 2 | 2012 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 8897 | 9.263158 | 10.427055 | 7.316967 | 2 | 7 | 10 | 2011 | 0 | 1 |
| 8898 | 12.772727 | 12.201552 | 10.641034 | 25 | 2 | 9 | 2012 | 0 | 0 |
| 8899 | 11.047619 | 10.508746 | 7.908705 | 19 | 4 | 1 | 2012 | 0 | 0 |
| 8900 | 6.076923 | 10.174042 | 4.030334 | 3 | 7 | 2 | 2013 | 0 | 1 |
| 8901 | 9.526316 | 9.202674 | 6.167733 | 26 | 6 | 11 | 2011 | 0 | 1 |
8902 rows × 9 columns
Q5¶
The Root Mean Squared Error (RMSE) measures the average magnitude of the errors between predicted values from our model and actual values. Its particularity is that it is sensible to large values as it squares the differenceds. Thus, it penalizes larger errors more heavily than smaller ones which makes it very useful when large errors are our priority. It is also relavant as it's applicable (and most commonly used for) regression tasks and the value we get is in the same unit as our target variable.
For this regression task, we will use the built-in sklearn function to compute the RMSE.
Q6¶
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import root_mean_squared_error
%matplotlib inline
import matplotlib.pyplot as plt
X = df2.drop(columns=['log_PAX'])
y = df2['log_PAX']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=4839) # train-test split
# Training the DecisionTree regressor with different max_depth values and calculating RMSE
train_rmse = []
test_rmse = []
max_depth_range = range(1, 21)
for max_depth in max_depth_range:
model = DecisionTreeRegressor(max_depth=max_depth, random_state=4839)
model.fit(X_train, y_train)
# Predict on training and test sets
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)
# Calculate RMSE for train and test sets
train_rmse.append(root_mean_squared_error(y_train, y_train_pred))
test_rmse.append(root_mean_squared_error(y_test, y_test_pred))
# Plotting RMSE as a function of max_depth
fig, ax1 = plt.subplots(figsize=(10, 6))
plt.plot(max_depth_range, train_rmse, label='Train RMSE', marker='o')
plt.plot(max_depth_range, test_rmse, label='Test RMSE', marker='o')
plt.xlabel('max_depth')
plt.ylabel('RMSE')
plt.title('RMSE on Train and Test Sets as a Function of max_depth')
plt.legend()
plt.grid()
plt.show()
optimal_depth = np.argmin(test_rmse) + 1 # + 1 as it returns the index value, not the actual argmin
optimal_depth
7
Q7¶
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
# Without Standard Scaler
pipeline_no_scaler = Pipeline([
('regressor', DecisionTreeRegressor(max_depth=optimal_depth, random_state=4839)) # to avoid unnecessary computations
])
# With Standard Scaler
pipeline_with_scaler = Pipeline([
('scaler', StandardScaler()),
('regressor', DecisionTreeRegressor(max_depth=optimal_depth, random_state=4839)) # to avoid unnecessary computations
])
pipeline_no_scaler.fit(X_train, y_train)
pipeline_with_scaler.fit(X_train, y_train)
y_pred_no_scaler = pipeline_no_scaler.predict(X_test)
y_pred_with_scaler = pipeline_with_scaler.predict(X_test)
rmse_no_scaler = root_mean_squared_error(y_test, y_pred_no_scaler)
rmse_with_scaler = root_mean_squared_error(y_test, y_pred_with_scaler)
print(f"RMSE without StandardScaler: {rmse_no_scaler:.4f}")
print(f"RMSE with StandardScaler: {rmse_with_scaler:.4f}")
RMSE without StandardScaler: 0.7886 RMSE with StandardScaler: 0.7886
We notice that both errors are the same. This is because decision trees split data based on thresholds, so the scale of the features doesn't really matter, thus the standard scaler does not have any effect on the model output.
Q8¶
# Linear Regression model with and without StandardScaler, with and without intercept
from sklearn.linear_model import LinearRegression
linreg_no_scaler = Pipeline([
('regressor', LinearRegression(fit_intercept=True))
])
linreg_with_scaler = Pipeline([
('scaler', StandardScaler()),
('regressor', LinearRegression(fit_intercept=True))
])
linreg_no_scaler2 = Pipeline([
('regressor', LinearRegression(fit_intercept=False))
])
linreg_with_scaler2 = Pipeline([
('scaler', StandardScaler()),
('regressor', LinearRegression(fit_intercept=False))
])
linreg_no_scaler.fit(X_train, y_train)
linreg_with_scaler.fit(X_train, y_train)
linreg_no_scaler2.fit(X_train, y_train)
linreg_with_scaler2.fit(X_train, y_train)
y_pred_no_scaler = linreg_no_scaler.predict(X_test)
y_pred_with_scaler = linreg_with_scaler.predict(X_test)
y_pred_no_scaler2 = linreg_no_scaler2.predict(X_test)
y_pred_with_scaler2 = linreg_with_scaler2.predict(X_test)
rmse_no_scaler = root_mean_squared_error(y_test, y_pred_no_scaler)
rmse_with_scaler = root_mean_squared_error(y_test, y_pred_with_scaler)
rmse_no_scaler2 = root_mean_squared_error(y_test, y_pred_no_scaler2)
rmse_with_scaler2 = root_mean_squared_error(y_test, y_pred_with_scaler2)
print(f"RMSE without StandardScaler: {rmse_no_scaler:.4f}")
print(f"RMSE with StandardScaler: {rmse_with_scaler:.4f}")
print(f"RMSE without StandardScaler & no Intercept : {rmse_no_scaler2:.4f}")
print(f"RMSE with StandardScaler & no Intercept: {rmse_with_scaler2:.4f}")
RMSE without StandardScaler: 0.8676 RMSE with StandardScaler: 0.8676 RMSE without StandardScaler & no Intercept : 0.8728 RMSE with StandardScaler & no Intercept: 11.0218
fit_intercept=True (default): Includes an intercept term, which is often helpful in fitting the data more accurately. fit_intercept=False: Excludes the intercept, forcing the model to pass through the origin, which is useful in specific cases where the data is already centered or domain knowledge dictates this relationship.
Here, the scaling seems to have little to no effect on the model error when there is an intercept found by the model. By curiosity, we tested out without an Intercept and saw a consequent difference with and without scaling. This can be due to the fact that x=0 does not naturally correspond to y=0 in our data.
Q9¶
from sklearn.preprocessing import OneHotEncoder
categorical_features = df.select_dtypes(include=['object']).columns
categorical_data = df[categorical_features]
print(categorical_features)
encoder = OneHotEncoder()
# Transform the data
encoded_data = encoder.fit_transform(categorical_data)
categories = encoder.get_feature_names_out(categorical_features)
print(categories)
df_encoded_1 = df.drop(columns=categorical_features)
df_encoded = pd.concat([df_encoded_1, pd.DataFrame(encoded_data.toarray(), columns=categories)], axis=1)
df_encoded.head()
Index(['Departure', 'Arrival'], dtype='object') ['Departure_ATL' 'Departure_BOS' 'Departure_CLT' 'Departure_DEN' 'Departure_DFW' 'Departure_DTW' 'Departure_EWR' 'Departure_IAH' 'Departure_JFK' 'Departure_LAS' 'Departure_LAX' 'Departure_LGA' 'Departure_MCO' 'Departure_MIA' 'Departure_MSP' 'Departure_ORD' 'Departure_PHL' 'Departure_PHX' 'Departure_SEA' 'Departure_SFO' 'Arrival_ATL' 'Arrival_BOS' 'Arrival_CLT' 'Arrival_DEN' 'Arrival_DFW' 'Arrival_DTW' 'Arrival_EWR' 'Arrival_IAH' 'Arrival_JFK' 'Arrival_LAS' 'Arrival_LAX' 'Arrival_LGA' 'Arrival_MCO' 'Arrival_MIA' 'Arrival_MSP' 'Arrival_ORD' 'Arrival_PHL' 'Arrival_PHX' 'Arrival_SEA' 'Arrival_SFO']
| DateOfDeparture | WeeksToDeparture | log_PAX | std_wtd | day | weekday | week | month | year | holiday | ... | Arrival_LAX | Arrival_LGA | Arrival_MCO | Arrival_MIA | Arrival_MSP | Arrival_ORD | Arrival_PHL | Arrival_PHX | Arrival_SEA | Arrival_SFO | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2012-06-19 | 12.875000 | 12.331296 | 9.812647 | 19 | 2 | 25 | 6 | 2012 | 0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1 | 2012-09-10 | 14.285714 | 10.775182 | 9.466734 | 10 | 1 | 37 | 9 | 2012 | 0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 2012-10-05 | 10.863636 | 11.083177 | 9.035883 | 5 | 5 | 40 | 10 | 2012 | 0 | ... | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 2011-10-09 | 11.480000 | 11.169268 | 7.990202 | 9 | 7 | 40 | 10 | 2011 | 0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 4 | 2012-02-21 | 11.450000 | 11.269364 | 9.517159 | 21 | 2 | 8 | 2 | 2012 | 0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
5 rows × 51 columns
Q10¶
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestRegressor
numerical_features = df.select_dtypes(include=['int', 'float']).columns.tolist()
numerical_features = [col for col in numerical_features if col != 'log_PAX'] # as log_PAX is not a feature
preprocessor = ColumnTransformer(
transformers=[
('numerical', StandardScaler(), numerical_features),
('categorical', OneHotEncoder(), categorical_features)
]
)
random_forest = Pipeline([
('preprocessor', preprocessor),
('regressor', RandomForestRegressor())
])
X = df.drop(columns=['log_PAX'])
y = df['log_PAX']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=4839)
random_forest.fit(X, y)
y_pred = random_forest.predict(X_test)
rmse = root_mean_squared_error(y_test, y_pred)
print(f"RMSE: {rmse}")
RMSE: 0.1763120567154462
Q11¶
from sklearn.model_selection import GridSearchCV
import time
param_grid = {
'regressor__n_estimators': [10, 50, 100], # Number of trees
'regressor__max_depth': [None, 5, 10, 20] # Maximum depth of each tree
}
# Set up the grid search piepline
grid_search = GridSearchCV(
random_forest, # defined on the above cell
param_grid,
cv=5, # 5-fold cross-validation
scoring='neg_mean_squared_error',
)
# Perform grid search
start_time = time.time()
grid_search.fit(X_train, y_train)
time_taken = time.time() - start_time
# Print the best parameters and corresponding score
print("Best Parameters:", grid_search.best_params_)
print("Best Cross-Validation Score (Negative MSE):", abs(grid_search.best_score_))
print("Execution time:", time_taken, "seconds")
# Test the best model on the test set
best_model = grid_search.best_estimator_
test_score = best_model.score(X_test, y_test)
print("Test Set R^2 Score:", test_score)
Best Parameters: {'regressor__max_depth': None, 'regressor__n_estimators': 100}
Best Cross-Validation Score (Negative MSE): 0.26537361102959967
Execution time: 85.70539402961731 seconds
Test Set R^2 Score: 0.7426286140085898
PS: here the MSE is negative as it's the way sklearn works. It takes the maximum of the negative MSE, which is like minimizing the normal MSE. The actual MSE is simply the positive version of the number returned. That's why I displayed it in absolute value.
Our grid search outputs that the optimal max depth is None and the optimal number of trees is 100. According to our search, this optimal setup is better than 50 trees of depth 5, or 100 trees of depth 10 for example, like we also tested. We can conclude that the model performed best when the trees in the Randoml Forest regressor were allowed to grow to their maximum possible depth, rather than restricting their depth. Hence, the model seems to be overfitting on the training set here.
Regarding execution time, grid search can be computationally expensive because it evaluates every combination of hyperparameters using cross-validation. The execution time depends on the size of the dataset obviously, but also the model complexity, the number of hyperparamters and mostly k ! (the number of cross-validation folds).
For $R^2$, 0.7421 means that when applying Random Forest to our test dataset witht the optimal parameters returned by the grid search, we get a model that explains approx. 74.21% of the variance, which from our point of view shows that the model correctly fits the data, but there is room for improvement.
Q12¶
import joblib
# Save the best pipeline (full pipeline with preprocessing and model)
joblib.dump(grid_search.best_estimator_, 'best_pipeline.pkl')
# Save only the trained RandomForest model
joblib.dump(best_model.named_steps['regressor'], 'best_model.pkl')
# Load from disk both the full pipeline from disk & the best RF model
loaded_pipeline = joblib.load('best_pipeline.pkl')
loaded_model = joblib.load('best_model.pkl')
# Test the loaded pipeline on the test set & confirm that the loaded pipeline's score matches the original
print(f"Loaded Pipeline Test Set R^2 Score: {loaded_pipeline.score(X_test, y_test)}")
assert abs(loaded_pipeline.score(X_test, y_test) - test_score) < 1e-5, "Loaded pipeline score doesn't match!"
Loaded Pipeline Test Set R^2 Score: 0.7426286140085898
As we can see, our $R^2$ is the same, as the model loaded from disk is confirmed to be the same we computed tuned in the previous question.
If I am not mistaken, it is sufficient to save only grid_search.best_estimator_ as this contains the entire pipeline, from preprocessing steps like scaling and encoding, to the trained RF model. However, joblib.dump(best_model.named_steps['regressor'], 'best_model.pkl') only contains the trained RF model.More generally, joblib.dump(best_model.named_steps, 'best_model.pkl') includes both preprocessing steps and model, but separately, so it's not directly usable on raw data.
Q13¶
K-NN is known to have no training required. The cost of fitting a KNN is just the cost of storing the entire dataset, which is not really demanding (O(1) computational cost).
The cost of predicting for one new point involves:
- Computing all the distances between this new points and the $n$ points in the datatset.
- Sorting all distances from smallest to greatest.
- Selecting the K (here 1) smallest distances.
- Identifies the most common class among the K-nearest neighbors.
Thus, this amounts to a total computational cost of $O(n \cdot d+n \cdot log(K))$ to predict a new observation.
As we have seen, predicting for a single new point becomes computationally expensive as the training set size $n$ grows, making basic K-NN less suitable for very large datasets without optimizations. It basically shifts the burden from the training phase (no training required whatsoever) to the prediction phase which requires to compute all $n$ distances, sort them and find the "K-smallest distances" and choosing the smallest one (of aggregating the classes of the K-smallest ones to decide where the enw one will go).
Q14¶
from scipy.stats import mode
from sklearn.base import BaseEstimator, ClassifierMixin
class KNearestNeighbor(BaseEstimator, ClassifierMixin):
def __init__(self, k):
self.k = k
self.X_train = None
self.y_train = None
def fit(self, X, y):
self.X_train = X_train
self.y_train = y_train
self.classes_ = np.unique(y) # added this to store class labels so thgat cross_val_scores can retrieve classes_ later on
def predict(self, X_test):
predictions = []
for x in X_test:
# Compute distances between x and all examples in the training set
distances = np.sqrt(np.sum((self.X_train - x)**2, axis=1))
kn_indices = np.argsort(distances)[:self.k] # Indices of k nearest neighbors
kn_labels = self.y_train[kn_indices] # Labels of k nearest neighbors
# Predict the majority class
prediction = mode(kn_labels, keepdims=True).mode[0] # keepdims is to avoid dimension mismatch with input
predictions.append(prediction)
return predictions
Q15¶
import numpy as np
import matplotlib.pyplot as plt
from sklearn.utils import check_random_state
def rand_checkers(n_samples=2000, sigma=0.1, random_state=0):
rng = check_random_state(random_state)
nbp = n_samples // 16
nbn = n_samples // 16
xapp = rng.rand((nbp + nbn) * 16).reshape((nbp + nbn) * 8, 2)
yapp = np.ones((nbp + nbn) * 8)
idx = 0
for i in range(-2, 2):
for j in range(-2, 2):
if ((i + j) % 2) == 0:
nb = nbp
else:
nb = nbn
yapp[idx:(idx + nb)] = [(i + j) % 3 + 1] * nb
xapp[idx:(idx + nb), 0] = rng.rand(nb)
xapp[idx:(idx + nb), 0] += i + sigma * rng.randn(nb)
xapp[idx:(idx + nb), 1] = rng.rand(nb)
xapp[idx:(idx + nb), 1] += j + sigma * rng.randn(nb)
idx += nb
ind = np.arange(xapp.shape[0])
rng.shuffle(ind)
res = np.hstack([xapp, yapp[:, np.newaxis]])
return np.array(res[ind, :2]), np.array(res[ind, 2])
X, y = rand_checkers()
plt.figure(figsize=(8, 6))
scatter = plt.scatter(X[:, 0], X[:, 1], c=y, cmap='Accent', edgecolor='k', s=50)
plt.title('rand_checkers Dataset')
plt.grid()
plt.legend(handles=scatter.legend_elements()[0], labels=['Class 1', 'Class 2', 'Class 3'])
plt.show()
We can see a 2D dataset where each point, defined by 2 features, belongs to one of 3 classes, represented by blue, green, and grey points. The points are arranged in a checkerboard pattern, with alternating regions dominated by one class. This creates a structured, non-linear distribution with clear but irregular class boundaries.
Overall, points are distributed uniformyl. In regions where the classes meet, points overlap, which is normal if they have been classified with K-NN.
Q16¶
from sklearn.model_selection import cross_val_score, train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
k_values = range(1, 21)
train_scores = []
test_scores = []
for k in k_values:
knn = KNearestNeighbor(k=k)
# cross-validation score (avg score over 10 folds)
cv_scores = cross_val_score(knn, X_train, y_train, cv=10, scoring='accuracy')
knn.fit(X_train, y_train)
train_scores.append(knn.score(X_train, y_train))
test_scores.append(knn.score(X_test, y_test))
plt.figure(figsize=(8, 6))
plt.plot(k_values, train_scores, label='Train Score', marker='o')
plt.plot(k_values, test_scores, label='Test Score', marker='s')
plt.xlabel('Number of Neighbors (k)')
plt.ylabel('Accuracy')
plt.title('KNN Performance vs. Number of Neighbors')
plt.legend()
plt.grid()
plt.show()
We see that for $k=1$ the accuracy is at 1. $k=1$ means each data point is its own nearest neighbor, so the model perfectly fits the training data (no generalization). This indicates overfitting. The training accuracy decreases because more neighbors are included in the decision, leading to more generalized boundaries rather than perfectly fitting the training data, but it stabilizes for larger k values.
The testing accuracy is lower and fluctuates more significantly. This is because small values of k lead to overfitting on the training data, causing the model to perform poorly on the test set (low generalization). The testing accuracy improves initially as the model starts generalizing better, reducing overfitting. However, if k becomes too large, the model underfits because it averages over too many neighbors, including those that are far away and less relevant to the target data point.
Exercise 2: Encoding and Hyperparameter Tuning with Optima¶
Q17¶
adult_census = pd.read_csv("https://www.openml.org/data/get_csv/1595261/adult-census.csv")
adult_census.dropna(inplace=True)
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, HistGradientBoostingClassifier
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder
X = adult_census.drop(columns=['class'])
y = adult_census['class']
categorical_columns = X.select_dtypes(include=['object']).columns
numerical_columns = X.select_dtypes(exclude=['object']).columns
# Define train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)
# Define models
models = {
'LogisticRegression': LogisticRegression(max_iter=5000, random_state=42),
'RandomForest': RandomForestClassifier(random_state=42),
'HistGradientBoosting': HistGradientBoostingClassifier(random_state=42)
}
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder, FunctionTransformer
from scipy.sparse import csr_matrix
dense_transformer = FunctionTransformer(lambda x: x.toarray() if isinstance(x, csr_matrix) else x)
# OneHot Encoding
one_hot_pipeline = Pipeline(steps=[
('preprocessor', ColumnTransformer(
transformers=[
('num', Pipeline(steps=[
('scale', StandardScaler()),
]), numerical_columns),
('cat', Pipeline(steps=[
('onehot', OneHotEncoder(handle_unknown='ignore')),
('to_dense', dense_transformer)
]), categorical_columns)
])),
])
# Ordinal Encoding
ordinal_pipeline = Pipeline(steps=[
('preprocessor', ColumnTransformer(
transformers=[
('num', Pipeline(steps=[
('scale', StandardScaler())
]), numerical_columns),
('cat', Pipeline(steps=[
('ordinal', OrdinalEncoder())
]), categorical_columns)
])),
])
from sklearn.metrics import accuracy_score
# Function to train and evaluate models
def evaluate_models(pipeline, encoding_type, X_train, X_test, y_train, y_test):
results = {}
for name, model in models.items():
full_pipeline = Pipeline(steps=[
('preprocessing', pipeline),
('model', model)
])
full_pipeline.fit(X_train, y_train)
y_pred = full_pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
results[f'{name} ({encoding_type})'] = accuracy
return results
# Evaluate both encoding strategies
one_hot_results = evaluate_models(one_hot_pipeline, "OneHot Encoding", X_train, X_test, y_train, y_test)
ordinal_results = evaluate_models(ordinal_pipeline, "Ordinal Encoding", X_train, X_test, y_train, y_test)
# Combine results
results = {**one_hot_results, **ordinal_results}
for model, accuracy in results.items():
print(f"{model}: {accuracy:.4f}")
LogisticRegression (OneHot Encoding): 0.8550 RandomForest (OneHot Encoding): 0.8576 HistGradientBoosting (OneHot Encoding): 0.8765 LogisticRegression (Ordinal Encoding): 0.8277 RandomForest (Ordinal Encoding): 0.8605 HistGradientBoosting (Ordinal Encoding): 0.8772
Logistic regression performs better with one-hot encoding (0.8576) compared to ordinal encoding (0.8280). This is because one-hot encoding effectively handles categorical variables without introducing an artificial ordinal relationship. In contrast, ordinal encoding imposes an order on categorical data, which can negatively impact the model's performance.
For random forest, the performance is nearly the same with both encoding methods, although ordinal encoding achieves slightly higher accuracy than one-hot encoding. Ordinal encoding reduces the number of features compared to one-hot encoding, making the model less complex and less prone to overfitting.
Similarly, HistGradientBoosting performs almost equally well with both encoding strategies, with ordinal encoding showing a slight advantage (0.8772) over one-hot encoding (0.8765). The difference, however, is minimal. Gradient boosting models are inherently efficient at handling numerical and ordinal features, making them less sensitive to the choice of encoding.
In conclusion, ordinal encoding is more suitable for tree-based models like random forest and HistGradientBoosting, while one-hot encoding is better suited for linear models such as logistic regression, which significantly benefits from it. Among the models evaluated, HistGradientBoosting achieves the highest accuracy across both encoding strategies, demonstrating that it is the best model for this dataset.
Q18¶
import optuna
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline
# Define the objective function for Optuna
def objective(trial):
# Suggest a learning rate
learning_rate = trial.suggest_loguniform("learning_rate", 0.01, 0.5)
# Define the model with the suggested learning rate
model = HistGradientBoostingClassifier(learning_rate=learning_rate, random_state=42)
pipeline = Pipeline(steps=[
('preprocessing', ordinal_pipeline),
('model', model)
])
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
return accuracy
# Create an Optuna study and optimize
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=50) # Perform 50 trials
# Finding the optimal learning rate and corresponding accurcay
best_learning_rate = study.best_params["learning_rate"]
best_accuracy = study.best_value
# Train the best model with the optimal learning rate
optimized_model = HistGradientBoostingClassifier(learning_rate=best_learning_rate, random_state=42)
optimized_pipeline = Pipeline(steps=[
('preprocessing', ordinal_pipeline),
('model', optimized_model)
])
optimized_pipeline.fit(X_train, y_train)
optimized_y_pred = optimized_pipeline.predict(X_test)
optimized_accuracy = accuracy_score(y_test, optimized_y_pred)
print(f"Optimized HistGradientBoostingClassifier Accuracy: {optimized_accuracy:.4f} (Learning Rate: {best_learning_rate:.4f})")
[I 2025-03-11 14:49:14,415] A new study created in memory with name: no-name-05ead8e4-38a9-418d-b064-87c24cc23c5d
/var/folders/1q/4p_js9jn14x3_527d5fmd8dr0000gn/T/ipykernel_91235/4092926628.py:9: FutureWarning: suggest_loguniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use suggest_float(..., log=True) instead.
learning_rate = trial.suggest_loguniform("learning_rate", 0.01, 0.5)
[I 2025-03-11 14:49:15,302] Trial 0 finished with value: 0.8667849587115266 and parameters: {'learning_rate': 0.020133696922712555}. Best is trial 0 with value: 0.8667849587115266.
/var/folders/1q/4p_js9jn14x3_527d5fmd8dr0000gn/T/ipykernel_91235/4092926628.py:9: FutureWarning: suggest_loguniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use suggest_float(..., log=True) instead.
learning_rate = trial.suggest_loguniform("learning_rate", 0.01, 0.5)
[I 2025-03-11 14:49:15,701] Trial 1 finished with value: 0.8747014263290793 and parameters: {'learning_rate': 0.2708807763388238}. Best is trial 1 with value: 0.8747014263290793.
/var/folders/1q/4p_js9jn14x3_527d5fmd8dr0000gn/T/ipykernel_91235/4092926628.py:9: FutureWarning: suggest_loguniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use suggest_float(..., log=True) instead.
learning_rate = trial.suggest_loguniform("learning_rate", 0.01, 0.5)
[I 2025-03-11 14:49:15,962] Trial 2 finished with value: 0.8742919538661025 and parameters: {'learning_rate': 0.31441311988132675}. Best is trial 1 with value: 0.8747014263290793.
/var/folders/1q/4p_js9jn14x3_527d5fmd8dr0000gn/T/ipykernel_91235/4092926628.py:9: FutureWarning: suggest_loguniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use suggest_float(..., log=True) instead.
learning_rate = trial.suggest_loguniform("learning_rate", 0.01, 0.5)
[I 2025-03-11 14:49:16,399] Trial 3 finished with value: 0.8751791442025524 and parameters: {'learning_rate': 0.18434679344865298}. Best is trial 3 with value: 0.8751791442025524.
/var/folders/1q/4p_js9jn14x3_527d5fmd8dr0000gn/T/ipykernel_91235/4092926628.py:9: FutureWarning: suggest_loguniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use suggest_float(..., log=True) instead.
learning_rate = trial.suggest_loguniform("learning_rate", 0.01, 0.5)
[I 2025-03-11 14:49:16,872] Trial 4 finished with value: 0.8777042243909097 and parameters: {'learning_rate': 0.08106175501850842}. Best is trial 4 with value: 0.8777042243909097.
/var/folders/1q/4p_js9jn14x3_527d5fmd8dr0000gn/T/ipykernel_91235/4092926628.py:9: FutureWarning: suggest_loguniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use suggest_float(..., log=True) instead.
learning_rate = trial.suggest_loguniform("learning_rate", 0.01, 0.5)
[I 2025-03-11 14:49:17,028] Trial 5 finished with value: 0.874018972224118 and parameters: {'learning_rate': 0.4325970794285743}. Best is trial 4 with value: 0.8777042243909097.
/var/folders/1q/4p_js9jn14x3_527d5fmd8dr0000gn/T/ipykernel_91235/4092926628.py:9: FutureWarning: suggest_loguniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use suggest_float(..., log=True) instead.
learning_rate = trial.suggest_loguniform("learning_rate", 0.01, 0.5)
[I 2025-03-11 14:49:17,745] Trial 6 finished with value: 0.872176346140722 and parameters: {'learning_rate': 0.03052806399733684}. Best is trial 4 with value: 0.8777042243909097.
/var/folders/1q/4p_js9jn14x3_527d5fmd8dr0000gn/T/ipykernel_91235/4092926628.py:9: FutureWarning: suggest_loguniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use suggest_float(..., log=True) instead.
learning_rate = trial.suggest_loguniform("learning_rate", 0.01, 0.5)
[I 2025-03-11 14:49:18,095] Trial 7 finished with value: 0.8737459905821333 and parameters: {'learning_rate': 0.27287949134520073}. Best is trial 4 with value: 0.8777042243909097.
/var/folders/1q/4p_js9jn14x3_527d5fmd8dr0000gn/T/ipykernel_91235/4092926628.py:9: FutureWarning: suggest_loguniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use suggest_float(..., log=True) instead.
learning_rate = trial.suggest_loguniform("learning_rate", 0.01, 0.5)
[I 2025-03-11 14:49:18,786] Trial 8 finished with value: 0.8769535248754521 and parameters: {'learning_rate': 0.07942480838819871}. Best is trial 4 with value: 0.8777042243909097.
/var/folders/1q/4p_js9jn14x3_527d5fmd8dr0000gn/T/ipykernel_91235/4092926628.py:9: FutureWarning: suggest_loguniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use suggest_float(..., log=True) instead.
learning_rate = trial.suggest_loguniform("learning_rate", 0.01, 0.5)
[I 2025-03-11 14:49:19,406] Trial 9 finished with value: 0.8755203712550331 and parameters: {'learning_rate': 0.04613489945341571}. Best is trial 4 with value: 0.8777042243909097.
/var/folders/1q/4p_js9jn14x3_527d5fmd8dr0000gn/T/ipykernel_91235/4092926628.py:9: FutureWarning: suggest_loguniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use suggest_float(..., log=True) instead.
learning_rate = trial.suggest_loguniform("learning_rate", 0.01, 0.5)
[I 2025-03-11 14:49:20,057] Trial 10 finished with value: 0.8774994881594212 and parameters: {'learning_rate': 0.10257039674357407}. Best is trial 4 with value: 0.8777042243909097.
/var/folders/1q/4p_js9jn14x3_527d5fmd8dr0000gn/T/ipykernel_91235/4092926628.py:9: FutureWarning: suggest_loguniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use suggest_float(..., log=True) instead.
learning_rate = trial.suggest_loguniform("learning_rate", 0.01, 0.5)
[I 2025-03-11 14:49:20,611] Trial 11 finished with value: 0.8774312427489251 and parameters: {'learning_rate': 0.0930757914371597}. Best is trial 4 with value: 0.8777042243909097.
/var/folders/1q/4p_js9jn14x3_527d5fmd8dr0000gn/T/ipykernel_91235/4092926628.py:9: FutureWarning: suggest_loguniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use suggest_float(..., log=True) instead.
learning_rate = trial.suggest_loguniform("learning_rate", 0.01, 0.5)
[I 2025-03-11 14:49:21,306] Trial 12 finished with value: 0.8761345799494984 and parameters: {'learning_rate': 0.12770393351392245}. Best is trial 4 with value: 0.8777042243909097.
/var/folders/1q/4p_js9jn14x3_527d5fmd8dr0000gn/T/ipykernel_91235/4092926628.py:9: FutureWarning: suggest_loguniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use suggest_float(..., log=True) instead.
learning_rate = trial.suggest_loguniform("learning_rate", 0.01, 0.5)
[I 2025-03-11 14:49:21,871] Trial 13 finished with value: 0.8615300621033235 and parameters: {'learning_rate': 0.01198925260525477}. Best is trial 4 with value: 0.8777042243909097.
/var/folders/1q/4p_js9jn14x3_527d5fmd8dr0000gn/T/ipykernel_91235/4092926628.py:9: FutureWarning: suggest_loguniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use suggest_float(..., log=True) instead.
learning_rate = trial.suggest_loguniform("learning_rate", 0.01, 0.5)
[I 2025-03-11 14:49:22,381] Trial 14 finished with value: 0.8766122978229715 and parameters: {'learning_rate': 0.04962617868360298}. Best is trial 4 with value: 0.8777042243909097.
/var/folders/1q/4p_js9jn14x3_527d5fmd8dr0000gn/T/ipykernel_91235/4092926628.py:9: FutureWarning: suggest_loguniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use suggest_float(..., log=True) instead.
learning_rate = trial.suggest_loguniform("learning_rate", 0.01, 0.5)
[I 2025-03-11 14:49:22,824] Trial 15 finished with value: 0.8763393161809868 and parameters: {'learning_rate': 0.14212138300257324}. Best is trial 4 with value: 0.8777042243909097.
/var/folders/1q/4p_js9jn14x3_527d5fmd8dr0000gn/T/ipykernel_91235/4092926628.py:9: FutureWarning: suggest_loguniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use suggest_float(..., log=True) instead.
learning_rate = trial.suggest_loguniform("learning_rate", 0.01, 0.5)
[I 2025-03-11 14:49:23,353] Trial 16 finished with value: 0.8765440524124752 and parameters: {'learning_rate': 0.05869283045425209}. Best is trial 4 with value: 0.8777042243909097.
/var/folders/1q/4p_js9jn14x3_527d5fmd8dr0000gn/T/ipykernel_91235/4092926628.py:9: FutureWarning: suggest_loguniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use suggest_float(..., log=True) instead.
learning_rate = trial.suggest_loguniform("learning_rate", 0.01, 0.5)
[I 2025-03-11 14:49:23,887] Trial 17 finished with value: 0.8766805432334676 and parameters: {'learning_rate': 0.09102538426335664}. Best is trial 4 with value: 0.8777042243909097.
/var/folders/1q/4p_js9jn14x3_527d5fmd8dr0000gn/T/ipykernel_91235/4092926628.py:9: FutureWarning: suggest_loguniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use suggest_float(..., log=True) instead.
learning_rate = trial.suggest_loguniform("learning_rate", 0.01, 0.5)
[I 2025-03-11 14:49:24,477] Trial 18 finished with value: 0.8734730089401488 and parameters: {'learning_rate': 0.03427696018647037}. Best is trial 4 with value: 0.8777042243909097.
/var/folders/1q/4p_js9jn14x3_527d5fmd8dr0000gn/T/ipykernel_91235/4092926628.py:9: FutureWarning: suggest_loguniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use suggest_float(..., log=True) instead.
learning_rate = trial.suggest_loguniform("learning_rate", 0.01, 0.5)
[I 2025-03-11 14:49:25,000] Trial 19 finished with value: 0.8762710707704907 and parameters: {'learning_rate': 0.16090074792430303}. Best is trial 4 with value: 0.8777042243909097.
/var/folders/1q/4p_js9jn14x3_527d5fmd8dr0000gn/T/ipykernel_91235/4092926628.py:9: FutureWarning: suggest_loguniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use suggest_float(..., log=True) instead.
learning_rate = trial.suggest_loguniform("learning_rate", 0.01, 0.5)
[I 2025-03-11 14:49:25,611] Trial 20 finished with value: 0.8684910939739302 and parameters: {'learning_rate': 0.022671353921580036}. Best is trial 4 with value: 0.8777042243909097.
/var/folders/1q/4p_js9jn14x3_527d5fmd8dr0000gn/T/ipykernel_91235/4092926628.py:9: FutureWarning: suggest_loguniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use suggest_float(..., log=True) instead.
learning_rate = trial.suggest_loguniform("learning_rate", 0.01, 0.5)
[I 2025-03-11 14:49:26,308] Trial 21 finished with value: 0.8767487886439637 and parameters: {'learning_rate': 0.09872014212819964}. Best is trial 4 with value: 0.8777042243909097.
/var/folders/1q/4p_js9jn14x3_527d5fmd8dr0000gn/T/ipykernel_91235/4092926628.py:9: FutureWarning: suggest_loguniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use suggest_float(..., log=True) instead.
learning_rate = trial.suggest_loguniform("learning_rate", 0.01, 0.5)
[I 2025-03-11 14:49:27,018] Trial 22 finished with value: 0.8774312427489251 and parameters: {'learning_rate': 0.07356822012197047}. Best is trial 4 with value: 0.8777042243909097.
/var/folders/1q/4p_js9jn14x3_527d5fmd8dr0000gn/T/ipykernel_91235/4092926628.py:9: FutureWarning: suggest_loguniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use suggest_float(..., log=True) instead.
learning_rate = trial.suggest_loguniform("learning_rate", 0.01, 0.5)
[I 2025-03-11 14:49:27,807] Trial 23 finished with value: 0.876885279464956 and parameters: {'learning_rate': 0.1141271278093714}. Best is trial 4 with value: 0.8777042243909097.
/var/folders/1q/4p_js9jn14x3_527d5fmd8dr0000gn/T/ipykernel_91235/4092926628.py:9: FutureWarning: suggest_loguniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use suggest_float(..., log=True) instead.
learning_rate = trial.suggest_loguniform("learning_rate", 0.01, 0.5)
[I 2025-03-11 14:49:28,265] Trial 24 finished with value: 0.8754521258445369 and parameters: {'learning_rate': 0.21342599405823462}. Best is trial 4 with value: 0.8777042243909097.
/var/folders/1q/4p_js9jn14x3_527d5fmd8dr0000gn/T/ipykernel_91235/4092926628.py:9: FutureWarning: suggest_loguniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use suggest_float(..., log=True) instead.
learning_rate = trial.suggest_loguniform("learning_rate", 0.01, 0.5)
[I 2025-03-11 14:49:28,856] Trial 25 finished with value: 0.8764758070019791 and parameters: {'learning_rate': 0.062009518914653505}. Best is trial 4 with value: 0.8777042243909097.
/var/folders/1q/4p_js9jn14x3_527d5fmd8dr0000gn/T/ipykernel_91235/4092926628.py:9: FutureWarning: suggest_loguniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use suggest_float(..., log=True) instead.
learning_rate = trial.suggest_loguniform("learning_rate", 0.01, 0.5)
[I 2025-03-11 14:49:29,491] Trial 26 finished with value: 0.8752473896130485 and parameters: {'learning_rate': 0.03985909725554221}. Best is trial 4 with value: 0.8777042243909097.
/var/folders/1q/4p_js9jn14x3_527d5fmd8dr0000gn/T/ipykernel_91235/4092926628.py:9: FutureWarning: suggest_loguniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use suggest_float(..., log=True) instead.
learning_rate = trial.suggest_loguniform("learning_rate", 0.01, 0.5)
[I 2025-03-11 14:49:30,192] Trial 27 finished with value: 0.8763393161809868 and parameters: {'learning_rate': 0.1047880961809587}. Best is trial 4 with value: 0.8777042243909097.
/var/folders/1q/4p_js9jn14x3_527d5fmd8dr0000gn/T/ipykernel_91235/4092926628.py:9: FutureWarning: suggest_loguniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use suggest_float(..., log=True) instead.
learning_rate = trial.suggest_loguniform("learning_rate", 0.01, 0.5)
[I 2025-03-11 14:49:30,790] Trial 28 finished with value: 0.8777724698014059 and parameters: {'learning_rate': 0.07036335692509024}. Best is trial 28 with value: 0.8777724698014059.
/var/folders/1q/4p_js9jn14x3_527d5fmd8dr0000gn/T/ipykernel_91235/4092926628.py:9: FutureWarning: suggest_loguniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use suggest_float(..., log=True) instead.
learning_rate = trial.suggest_loguniform("learning_rate", 0.01, 0.5)
[I 2025-03-11 14:49:31,561] Trial 29 finished with value: 0.8699242475943493 and parameters: {'learning_rate': 0.02445184460679821}. Best is trial 28 with value: 0.8777724698014059.
/var/folders/1q/4p_js9jn14x3_527d5fmd8dr0000gn/T/ipykernel_91235/4092926628.py:9: FutureWarning: suggest_loguniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use suggest_float(..., log=True) instead.
learning_rate = trial.suggest_loguniform("learning_rate", 0.01, 0.5)
[I 2025-03-11 14:49:32,332] Trial 30 finished with value: 0.8552514843376783 and parameters: {'learning_rate': 0.010043465446904574}. Best is trial 28 with value: 0.8777724698014059.
/var/folders/1q/4p_js9jn14x3_527d5fmd8dr0000gn/T/ipykernel_91235/4092926628.py:9: FutureWarning: suggest_loguniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use suggest_float(..., log=True) instead.
learning_rate = trial.suggest_loguniform("learning_rate", 0.01, 0.5)
[I 2025-03-11 14:49:32,992] Trial 31 finished with value: 0.8767487886439637 and parameters: {'learning_rate': 0.06617873892979567}. Best is trial 28 with value: 0.8777724698014059.
/var/folders/1q/4p_js9jn14x3_527d5fmd8dr0000gn/T/ipykernel_91235/4092926628.py:9: FutureWarning: suggest_loguniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use suggest_float(..., log=True) instead.
learning_rate = trial.suggest_loguniform("learning_rate", 0.01, 0.5)
[I 2025-03-11 14:49:33,651] Trial 32 finished with value: 0.8769535248754521 and parameters: {'learning_rate': 0.08404561138257535}. Best is trial 28 with value: 0.8777724698014059.
/var/folders/1q/4p_js9jn14x3_527d5fmd8dr0000gn/T/ipykernel_91235/4092926628.py:9: FutureWarning: suggest_loguniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use suggest_float(..., log=True) instead.
learning_rate = trial.suggest_loguniform("learning_rate", 0.01, 0.5)
[I 2025-03-11 14:49:34,384] Trial 33 finished with value: 0.8759980891285061 and parameters: {'learning_rate': 0.04929389384698084}. Best is trial 28 with value: 0.8777724698014059.
/var/folders/1q/4p_js9jn14x3_527d5fmd8dr0000gn/T/ipykernel_91235/4092926628.py:9: FutureWarning: suggest_loguniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use suggest_float(..., log=True) instead.
learning_rate = trial.suggest_loguniform("learning_rate", 0.01, 0.5)
[I 2025-03-11 14:49:34,742] Trial 34 finished with value: 0.8766122978229715 and parameters: {'learning_rate': 0.20724095646946064}. Best is trial 28 with value: 0.8777724698014059.
/var/folders/1q/4p_js9jn14x3_527d5fmd8dr0000gn/T/ipykernel_91235/4092926628.py:9: FutureWarning: suggest_loguniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use suggest_float(..., log=True) instead.
learning_rate = trial.suggest_loguniform("learning_rate", 0.01, 0.5)
[I 2025-03-11 14:49:35,341] Trial 35 finished with value: 0.8770900156964444 and parameters: {'learning_rate': 0.14684903722369996}. Best is trial 28 with value: 0.8777724698014059.
/var/folders/1q/4p_js9jn14x3_527d5fmd8dr0000gn/T/ipykernel_91235/4092926628.py:9: FutureWarning: suggest_loguniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use suggest_float(..., log=True) instead.
learning_rate = trial.suggest_loguniform("learning_rate", 0.01, 0.5)
[I 2025-03-11 14:49:35,863] Trial 36 finished with value: 0.8770900156964444 and parameters: {'learning_rate': 0.11625154230619308}. Best is trial 28 with value: 0.8777724698014059.
/var/folders/1q/4p_js9jn14x3_527d5fmd8dr0000gn/T/ipykernel_91235/4092926628.py:9: FutureWarning: suggest_loguniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use suggest_float(..., log=True) instead.
learning_rate = trial.suggest_loguniform("learning_rate", 0.01, 0.5)
[I 2025-03-11 14:49:36,075] Trial 37 finished with value: 0.8711526649832799 and parameters: {'learning_rate': 0.4293794067109366}. Best is trial 28 with value: 0.8777724698014059.
/var/folders/1q/4p_js9jn14x3_527d5fmd8dr0000gn/T/ipykernel_91235/4092926628.py:9: FutureWarning: suggest_loguniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use suggest_float(..., log=True) instead.
learning_rate = trial.suggest_loguniform("learning_rate", 0.01, 0.5)
[I 2025-03-11 14:49:36,646] Trial 38 finished with value: 0.8738824814031256 and parameters: {'learning_rate': 0.1815151470488454}. Best is trial 28 with value: 0.8777724698014059.
/var/folders/1q/4p_js9jn14x3_527d5fmd8dr0000gn/T/ipykernel_91235/4092926628.py:9: FutureWarning: suggest_loguniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use suggest_float(..., log=True) instead.
learning_rate = trial.suggest_loguniform("learning_rate", 0.01, 0.5)
[I 2025-03-11 14:49:37,129] Trial 39 finished with value: 0.8769535248754521 and parameters: {'learning_rate': 0.07765494290105862}. Best is trial 28 with value: 0.8777724698014059.
/var/folders/1q/4p_js9jn14x3_527d5fmd8dr0000gn/T/ipykernel_91235/4092926628.py:9: FutureWarning: suggest_loguniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use suggest_float(..., log=True) instead.
learning_rate = trial.suggest_loguniform("learning_rate", 0.01, 0.5)
[I 2025-03-11 14:49:37,606] Trial 40 finished with value: 0.8768170340544599 and parameters: {'learning_rate': 0.053988863352311095}. Best is trial 28 with value: 0.8777724698014059.
/var/folders/1q/4p_js9jn14x3_527d5fmd8dr0000gn/T/ipykernel_91235/4092926628.py:9: FutureWarning: suggest_loguniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use suggest_float(..., log=True) instead.
learning_rate = trial.suggest_loguniform("learning_rate", 0.01, 0.5)
[I 2025-03-11 14:49:38,269] Trial 41 finished with value: 0.877362997338429 and parameters: {'learning_rate': 0.07609271693548464}. Best is trial 28 with value: 0.8777724698014059.
/var/folders/1q/4p_js9jn14x3_527d5fmd8dr0000gn/T/ipykernel_91235/4092926628.py:9: FutureWarning: suggest_loguniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use suggest_float(..., log=True) instead.
learning_rate = trial.suggest_loguniform("learning_rate", 0.01, 0.5)
[I 2025-03-11 14:49:38,893] Trial 42 finished with value: 0.876885279464956 and parameters: {'learning_rate': 0.0680411008787863}. Best is trial 28 with value: 0.8777724698014059.
/var/folders/1q/4p_js9jn14x3_527d5fmd8dr0000gn/T/ipykernel_91235/4092926628.py:9: FutureWarning: suggest_loguniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use suggest_float(..., log=True) instead.
learning_rate = trial.suggest_loguniform("learning_rate", 0.01, 0.5)
[I 2025-03-11 14:49:39,528] Trial 43 finished with value: 0.8760663345390023 and parameters: {'learning_rate': 0.09440951830699847}. Best is trial 28 with value: 0.8777724698014059.
/var/folders/1q/4p_js9jn14x3_527d5fmd8dr0000gn/T/ipykernel_91235/4092926628.py:9: FutureWarning: suggest_loguniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use suggest_float(..., log=True) instead.
learning_rate = trial.suggest_loguniform("learning_rate", 0.01, 0.5)
[I 2025-03-11 14:49:40,151] Trial 44 finished with value: 0.8755203712550331 and parameters: {'learning_rate': 0.04311505351323047}. Best is trial 28 with value: 0.8777724698014059.
/var/folders/1q/4p_js9jn14x3_527d5fmd8dr0000gn/T/ipykernel_91235/4092926628.py:9: FutureWarning: suggest_loguniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use suggest_float(..., log=True) instead.
learning_rate = trial.suggest_loguniform("learning_rate", 0.01, 0.5)
[I 2025-03-11 14:49:40,729] Trial 45 finished with value: 0.8758615983075139 and parameters: {'learning_rate': 0.13052978063911683}. Best is trial 28 with value: 0.8777724698014059.
/var/folders/1q/4p_js9jn14x3_527d5fmd8dr0000gn/T/ipykernel_91235/4092926628.py:9: FutureWarning: suggest_loguniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use suggest_float(..., log=True) instead.
learning_rate = trial.suggest_loguniform("learning_rate", 0.01, 0.5)
[I 2025-03-11 14:49:41,348] Trial 46 finished with value: 0.8748379171500716 and parameters: {'learning_rate': 0.03822473880687629}. Best is trial 28 with value: 0.8777724698014059.
/var/folders/1q/4p_js9jn14x3_527d5fmd8dr0000gn/T/ipykernel_91235/4092926628.py:9: FutureWarning: suggest_loguniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use suggest_float(..., log=True) instead.
learning_rate = trial.suggest_loguniform("learning_rate", 0.01, 0.5)
[I 2025-03-11 14:49:41,898] Trial 47 finished with value: 0.8770900156964444 and parameters: {'learning_rate': 0.058865579927629995}. Best is trial 28 with value: 0.8777724698014059.
/var/folders/1q/4p_js9jn14x3_527d5fmd8dr0000gn/T/ipykernel_91235/4092926628.py:9: FutureWarning: suggest_loguniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use suggest_float(..., log=True) instead.
learning_rate = trial.suggest_loguniform("learning_rate", 0.01, 0.5)
[I 2025-03-11 14:49:42,529] Trial 48 finished with value: 0.871698628267249 and parameters: {'learning_rate': 0.02955020790607107}. Best is trial 28 with value: 0.8777724698014059.
/var/folders/1q/4p_js9jn14x3_527d5fmd8dr0000gn/T/ipykernel_91235/4092926628.py:9: FutureWarning: suggest_loguniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use suggest_float(..., log=True) instead.
learning_rate = trial.suggest_loguniform("learning_rate", 0.01, 0.5)
[I 2025-03-11 14:49:42,999] Trial 49 finished with value: 0.8766805432334676 and parameters: {'learning_rate': 0.11371691555828067}. Best is trial 28 with value: 0.8777724698014059.
Optimized HistGradientBoostingClassifier Accuracy: 0.8778 (Learning Rate: 0.0704)
The HistGradientBoosting classifier achieves the highest accuracy among the three models, with an accuracy of 0.8779. A learning rate of 0.0844, tuned with Optuna, played a key role in balancing the trade-off between convergence speed and generalization.
Exercise 3: Processing Fuzzy Categorical Data¶
Q19¶
dfX = pd.read_csv("/Users/solalzana/Desktop/DSB PDS/Lab3 - sklearn & hyperparams tuning/data/skrub_data_lab3/salary_X.csv")
dfY = pd.read_csv("/Users/solalzana/Desktop/DSB PDS/Lab3 - sklearn & hyperparams tuning/data/skrub_data_lab3/salary_Y.csv")
dfX.set_index('Unnamed: 0', drop=True)
dfY.set_index('Unnamed: 0', drop=True)
| current_annual_salary | |
|---|---|
| Unnamed: 0 | |
| 0 | 69222.18 |
| 1 | 97392.47 |
| 2 | 104717.28 |
| 3 | 52734.57 |
| 4 | 93396.00 |
| ... | ... |
| 9223 | 72094.53 |
| 9224 | 169543.85 |
| 9225 | 102736.52 |
| 9226 | 153747.50 |
| 9227 | 75484.08 |
9228 rows × 1 columns
dfX.head()
# dfX.dtypes
| Unnamed: 0 | gender | department | department_name | division | assignment_category | employee_position_title | date_first_hired | year_first_hired | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | F | POL | Department of Police | MSB Information Mgmt and Tech Division Records... | Fulltime-Regular | Office Services Coordinator | 09/22/1986 | 1986 |
| 1 | 1 | M | POL | Department of Police | ISB Major Crimes Division Fugitive Section | Fulltime-Regular | Master Police Officer | 09/12/1988 | 1988 |
| 2 | 2 | F | HHS | Department of Health and Human Services | Adult Protective and Case Management Services | Fulltime-Regular | Social Worker IV | 11/19/1989 | 1989 |
| 3 | 3 | M | COR | Correction and Rehabilitation | PRRS Facility and Security | Fulltime-Regular | Resident Supervisor II | 05/05/2014 | 2014 |
| 4 | 4 | M | HCA | Department of Housing and Community Affairs | Affordable Housing Programs | Fulltime-Regular | Planning Specialist III | 03/05/2007 | 2007 |
This dataset appears to be about employees and information about their job such as when they where hired, department name etc. and their corresponding annual salaries.
The goal here seems to be to predict the current_annual_salary (target variable) based on employee characteristics (features in X).
Q20¶
distinct_modalities = dfX.nunique()
distinct_modalities
Unnamed: 0 9228 gender 2 department 37 department_name 37 division 694 assignment_category 2 employee_position_title 443 date_first_hired 2264 year_first_hired 51 dtype: int64
If we consider using OneHotEncore, we should be careful with the columns division and employee_position_title which can make the dataset messy due to many different unique values, which would require adding 694+443 columns to this dataset!
We should also be careful with date_first_hired and year_first_hired which are in string and integer format (not date) and which also contain a lot of different values.
Q21¶
The mmodalities are unnordered and numerous. However, there exists a notion of distance (or hierarchy here). The roman letters after the job title seem to infer the hierarchy level, and this could lead us to think that a GapEncoder may be better as it captures similaerities in strings that the OneHotEncoder could not capture (one difference in the string and OHE creates a new column, whereas GapEncore outputs in fewer dimensions and captures text similarities).
Q22¶
# !pip install skrub
from skrub import GapEncoder
data = pd.Series([
"Math, optimization",
"mathematics",
"maths, ml",
"ml.maths",
"machine learning",
"physics",
"phy",
"statistical physics",
"computational phys."
])
gap_encoder = GapEncoder(n_components=2) # we ask it to learn 2 components (we'll get 2 columns in the trasnformed output)
gap_encoder.fit(data)
print("Transformed Data:")
print(gap_encoder.transform(data))
clean_data = pd.Series(["physics"])
dirty_data = pd.Series(["physcis"])
clean_encoded = gap_encoder.transform(clean_data)
dirty_encoded = gap_encoder.transform(dirty_data)
print("Clean Encoded:", clean_encoded)
print("Dirty Encoded:", dirty_encoded)
print("Learned Components:")
print(gap_encoder.n_components)
Transformed Data: statistical, computational, physics optimization, learning, machine 0 0.070546 24.029454 1 0.108293 13.491707 2 0.051975 10.548025 3 0.051977 9.048023 4 0.052841 21.047159 5 7.546280 0.053720 6 1.549872 0.050128 7 25.541454 0.058546 8 25.531526 0.068474 Clean Encoded: statistical, computational, physics optimization, learning, machine 0 7.54628 0.05372 Dirty Encoded: statistical, computational, physics optimization, learning, machine 0 3.549864 0.050136 Learned Components: 2
Is the change in behavior good or bad ?
Well the encoder seems capture the semantic similarity between the clean and dirty inputs to some extent (as the second component doesn't change much), but the significant difference in the first component indicates that the encoder struggles to fully associate "physcis" with "physics".
For applications that rely on robust handling of typos or dirty data, this discrepancy might lead to issues. This behavior can be useful if the dirty data is semantically related but doesn't have an exact match (eg variations like "math" vs "mathematics").
n_components refers to the number of latent topics the GapEncoder identifies in the categorical data.
In this case, n_components=2 means the encoder has learned two latent topics.
- First topic: "statistical, computational, physics"
- Second topic: "learning, machine, maths"
Q23¶
dfY
| Unnamed: 0 | current_annual_salary | |
|---|---|---|
| 0 | 0 | 69222.18 |
| 1 | 1 | 97392.47 |
| 2 | 2 | 104717.28 |
| 3 | 3 | 52734.57 |
| 4 | 4 | 93396.00 |
| ... | ... | ... |
| 9223 | 9223 | 72094.53 |
| 9224 | 9224 | 169543.85 |
| 9225 | 9225 | 102736.52 |
| 9226 | 9226 | 153747.50 |
| 9227 | 9227 | 75484.08 |
9228 rows × 2 columns
from sklearn.ensemble import HistGradientBoostingRegressor
from skrub import TableVectorizer
X = dfX.copy()
y = dfY['current_annual_salary']
pipeline = Pipeline([
('vectorizer', TableVectorizer()),
('regressor', HistGradientBoostingRegressor())
])
pipeline.fit(X, y)
fitted_table_vectorizer = pipeline.named_steps['vectorizer']
print("Attributes of fitted TableVectorizer:")
print("Columns:",fitted_table_vectorizer.get_feature_names_out())
Attributes of fitted TableVectorizer: Columns: ['Unnamed: 0' 'gender_F' 'gender_M' 'gender_nan' 'department_BOA' 'department_BOE' 'department_CAT' 'department_CCL' 'department_CEC' 'department_CEX' 'department_COR' 'department_CUS' 'department_DEP' 'department_DGS' 'department_DHS' 'department_DLC' 'department_DOT' 'department_DPS' 'department_DTS' 'department_ECM' 'department_FIN' 'department_FRS' 'department_HCA' 'department_HHS' 'department_HRC' 'department_IGR' 'department_LIB' 'department_MPB' 'department_NDA' 'department_OAG' 'department_OCP' 'department_OHR' 'department_OIG' 'department_OLO' 'department_OMB' 'department_PIO' 'department_POL' 'department_PRO' 'department_REC' 'department_SHF' 'department_ZAH' 'department_name_Board of Appeals Department' 'department_name_Board of Elections' 'department_name_Community Engagement Cluster' 'department_name_Community Use of Public Facilities' 'department_name_Correction and Rehabilitation' "department_name_County Attorney's Office" 'department_name_County Council' 'department_name_Department of Environmental Protection' 'department_name_Department of Finance' 'department_name_Department of General Services' 'department_name_Department of Health and Human Services' 'department_name_Department of Housing and Community Affairs' 'department_name_Department of Liquor Control' 'department_name_Department of Permitting Services' 'department_name_Department of Police' 'department_name_Department of Public Libraries' 'department_name_Department of Recreation' 'department_name_Department of Technology Services' 'department_name_Department of Transportation' 'department_name_Ethics Commission' 'department_name_Fire and Rescue Services' 'department_name_Merit System Protection Board Department' 'department_name_Non-Departmental Account' 'department_name_Office of Agriculture' 'department_name_Office of Consumer Protection' 'department_name_Office of Emergency Management and Homeland Security' 'department_name_Office of Human Resources' 'department_name_Office of Human Rights' 'department_name_Office of Intergovernmental Relations Department' 'department_name_Office of Legislative Oversight' 'department_name_Office of Management and Budget' 'department_name_Office of Procurement' 'department_name_Office of Public Information' 'department_name_Office of Zoning and Administrative Hearings' 'department_name_Office of the Inspector General' 'department_name_Offices of the County Executive' "department_name_Sheriff's Office" 'division: rockville, twinbrook, library' 'division: patrol, 6th, 4th' 'division: behavioral, health, school' 'division: station, state, estate' 'division: security, mc311, mccf' 'division: silver, spring, ride' 'division: gaithersburg, nicholson, transit' 'division: personnel, special, inspections' 'division: services, highway, service' 'division: investigative, investigations, criminal' 'division: maintenance, facilities, eligibility' 'division: technology, systems, inmate' 'division: supports, support, sports' 'division: assault, family, sexual' 'division: communications, communication, immunization' 'division: traffic, safety, alcohol' 'division: welfare, children, childhood' 'division: programs, program, domestic' 'division: medical, animal, fiscal' 'division: warehouse, liquor, solid' 'division: development, budget, budgets' 'division: district, squad, 3rd' 'division: engineering, processing, parking' 'division: automotive, delivery, avenue' 'division: officers, officer, legislative' 'division: planning, training, recruit' 'division: emergency, accounts, centers' 'division: custody, toddlers, members' 'division: administration, battalion, administrative' 'division: management, assessment, protective' 'assignment_category_Parttime-Regular' 'employee_position_title: craftsworker, advisor, urban' 'employee_position_title: librarian, library, libraries' 'employee_position_title: firefighter, rescuer, master' 'employee_position_title: officer, police, office' 'employee_position_title: therapist, the, district' 'employee_position_title: engineer, auditor, volunteer' 'employee_position_title: operator, bus, operations' 'employee_position_title: technician, mechanic, hvac' 'employee_position_title: recreation, renovation, construction' 'employee_position_title: lieutenant, captain, chief' 'employee_position_title: manager, projects, project' 'employee_position_title: information, technology, technologist' 'employee_position_title: enforcement, permitting, inspector' 'employee_position_title: school, health, room' 'employee_position_title: liquor, clerk, store' 'employee_position_title: community, nurse, security' 'employee_position_title: income, assistance, finance' 'employee_position_title: correctional, correction, corporal' 'employee_position_title: crossing, purchasing, background' 'employee_position_title: equipment, apprentice, facilities' 'employee_position_title: communications, telecommunications, safety' 'employee_position_title: warehouse, welfare, caseworker' 'employee_position_title: administrative, legislative, principal' 'employee_position_title: coordinator, services, service' 'employee_position_title: environmental, scientist, enviromental' 'employee_position_title: assistant, attorney, fiscal' 'employee_position_title: budget, senior, analyst' 'employee_position_title: sergeant, sheriff, deputy' 'employee_position_title: program, programs, project' 'employee_position_title: supply, transit, systems' 'date_first_hired_year' 'date_first_hired_month' 'date_first_hired_day' 'date_first_hired_total_seconds' 'year_first_hired']
We can see that the Table Vectorizer makes the difference between low and high cardinality to see when to apply one or the other.
Here, fit successfully sorted all modalities into "folders", restraining the dimension of the dataset.